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We present a comparison of two english texts, written by Lewis Carroll, one (Alice in wonderland) 
and the other (Through a looking glass), the former translated into esperanto, in order to observe 
whether natural and artificial languages significantly differ from each other. We construct one 
dimensional time series like signals using either word lengths or word frequencies. We use the 
multifractal ideas for sorting out correlations in the writings. In order to check the robustness of the 
methods we also write (!) (consider?) the corresponding shufHed texts. We compare characteristic 
functions and e.g. observe marked differences in the (far from parabolic) /(a) curves, differences 
which we attribute to Tsallis non extensive statistical features in the frequency time series and length 
time series. The esperanto text has more extreme vallues. A very rough approximation consists in 
modeling the texts as a random Cantor set if resulting from a binomial cascade of long and short 
words (or words and blanks). This leads to parameters characterizing the text style, and most likely 
m fine the author writings. 

PACS numbers: 



As soon as modern fractals appeared in order to de- 
scribe physical objects, it was evident that some gen- 
eralization was in order: multifractals spurred up, e.g., 
since obvioulsy a fractal dimension D is not enough to 
describe an object [TJ [5]. The more so in non equilib- 
rium systems, characterized by some unusual dynamics. 
Through a generator and from an initiator one can pro- 
duce a fractal object with a given dimension. How to 
produce realistic and meaningful multifractal models is 
a challenge. Do they really exist Do multifractal 
model exist nowadays [4]? These questions come in par- 
allel with the measurement of the fractal dimension, ... 
and its distribution. One question of interest is whether 
the apparently multifractal nature of an object is due to 
its finite size or to a complex dynamical feature or some- 
thing else! Some attempt in this direction results from 
observation of multifractal features in meteorology and 
chmate studies [51017], but also in many other fields [S], 
like mathematical finance j4l [9l [T0| fTT| [T2l [13] 

Let us recall that one has basically to obtain a D{q) 
function or the /(a) spectrum, where q represents the de- 
gree of some moment distribution of some variable, and a 
is some sort of critical exponent at phase transitions, also 
called the Holder exponent; f{a) being its distribution. 

There is a need for experimental work leading to re- 
hable D{q) and f{a) data, before modeling. Interest- 
ing pioneering data should be here recalled : see work 
on DLA [Ulin], DNA [TS1II7], SOI [H], NAO [S], .... 
It appears that most of the time some "signal is either 
directly a time series or is transformed into a time se- 
ries; more generally, the signal is called a text, because 
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it can be decomposed through level thresholds which can 
be thought to be a set of characters taken from an al- 
phabet. Here below we take real texts in fact as the 
source of experimental observations and follow the mul- 
tifractal ideas to make an analysis of such texts. The 
main question concerns whether multifractals are indeed 
found in real texts; a question raised in [20 ; another is 
whether the technique of analysis can give some insight 
on a logical construction [21 , from which stems the pos- 
sible connection of such ideas with coding, transcription, 
machine translation, social distances, network properties 
of languages, ... |21] [22] [23] and more classically in 
physics about identifying coherent structures in spatially 
extended systems ,24] . We examine two classical english 
texts but also one translation, into an unusual language, 
and the corresponding shuffled texts. We focus on how 
local/structural properties develop into global ones. 

Since Shannon |25| , writings and codings are of inter- 
est in statistical physics. Writings are systems practically 
composed of a large number of internal components (the 
words, signs, and blanks in printed texts). In terms of 
complexity investigations, writings which are a form of 
recorded languages, like living systems, belong to the top 
level class involving highly optimized tolerance design, er- 
ror thresholds in optimal coding, and financial markets 
[5T]. Relevant questions pertain to the life time, concen- 
tration, distribution, .. complexity of these. One should 
distinguish two main frameworks. On one hand, lan- 
guage developments seem to be understandable through 
competitions, like in Ising models, and in self-organized 
systems. Their diffusion seems similar to percolation and 
nucleation-growth problems taking into account the exis- 
tence of different time scales, for inter- and intra- effects; 
this is the realm of anthropology. The second frame orig- 
inates from more classical linguistics studies; it pertains 
to the content and meanings of words and texts. Con- 
cerning the internal structure of a text, supposedly char- 
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acterized by the language in which it is written, it is 
well known that a text can be mapped into a signal, of 
course through the alphabet characters. However it can 
be also reduced to less abundant symbols through some 
threshold, like a time series, which can be a list of +1 
and -1, or sometimes 0. In fact, laws of text content and 
structures have been searched for a long time ago by Zipf 
and others, see many refs. in [55], through the least ef- 
fort (so called ranking) method. The technique is now 
currently applied in statistical physics as a first step to 
obtain, when they exist, the primary scaling law. Yet 
long range order correlations (LROC) between words in 
text are searched for. In [57] it was claimed that LROC 
express an author's ideas, and in fine consist in some 
author's signature. 

Interestingly writings can be thought as social net- 
works [23]. Social networks have fractal properties [28] : 
most usually they should be multifractals; one can thus 
imagine/consider that a text which is a form of partially 
self-organized social network (for words) due to gram- 
matical and style constraints present multifractal fea- 
tures. The properties of such texts taken as signals have 
already been examined, e.g., a multifractal analysis of 
Moby Dick letter distribution can be found in [20] . 

Even though we recognize such a pioneering paper, 
we stress that sentences made of words, not letters, are 
translated. Thus we present below an original consider- 
ation in this respect, i.e. the analysis and results about 
a translation between one of the most commonly used 
language, i.e. english, and a relatively recent language, 
i.e. esperanto. Esperanto is an artificially constructed 
language P5], which was intended to be an easy-to- learn 
lingua franca. Statistical analyses seem to indicate that 
esperanto's statistical proportions are similar to those of 
other languages [SlT. It was found that esperanto's sta- 
tistical proportions resemble mostly those of German and 
Spanish, and somewhat surprisingly least those of French 
and Italian. English seems to be the intermediary case. 

Comparison of different languages (writings) arising 
from apparently different origins or containing different 
signs, e.g. greek [31], turkish [32], Chinese [33], ... even 
somewhat artificial languages, like those used for simula- 
tion codes on computers [31] has also been made. To our 
knowledge few comparisons have been presented about 
written texts translated from one to another language 
[251 [55] and in particular from the point of view of LROC 
in words. 

The text used here was chosen for its wide diffusion, 
freely available from the web [37 and as a representative 
one of a famous scientist, Lewis CaroU, i.e. Alice in 
wonderland (AWL) |38j . Moreover knowing the special 
(mathematical) quality of this author mind, and some, as 
we thought a priori, some possibly special way of writ- 
ing, another text has been chosen for comparison, i.e. 
Through a looking glass (TLG) |39j : - to our knowledge 
only available in english (on the web). This will allow 
us to discuss whether the differences, if any, between es- 
peranto and english, are apparently due to the transla- 



tion or to the specificity of this author work. Previous 
work on the english AWL version (AWLeng) should be 
mentioned [ID], but pertains to a mere Zipf analysis. 

In Sect. |Tj we present some elementary facts on these 
texts and briefly expose the methodology, i.e. we em- 
phasize that we distinguish between "frequency time se- 
ries (FTS) and "length time series (LTS). We recall the 
multifractal technique for this specific application. In 
order to check the robustness of the method we also in- 
vent {write !) (or consider) the corresponding shuffled 
texts, to which we apply the same technique of analysis. 
Therefore, in Sect. jlTJ we present the somewhat unusual 
results, and discuss them in Sect. |III| For the simplicity 
of the discussion we very roughly approximate the texts 
as resulting from a binomial cascade of short and long 
words. We obtain parameters in fine characterizing the 
text style and the author's writings. We observe that 
such multifractals have a deep connection to Tsallis non 
extensive statistics as pointed out in ref. [41j in another 
framework. 



I. DATA AND METHODOLOGY 

For our empirical considerations we have selected the 
two texts here above mentioned downloading them from 
a freely available site ^37j, resulting obviously into three 
files, called. Next, we have removed the chapter heads. 
All our analyses are carried over this reorganized file for 
each text. Thereafter, we have shuffled these texts, in the 
files, without taking into account the punctuation |42j . 

There are two ways to construct a time series from 
such documents 

1. Take a document of N words. Select all the differ- 
ent words. Count the frequency / of appearance of 
each word in the document. The time of "appear- 
ance is played by the rank position of the word in 
the file. We map the word frequency to a time se- 
ries f{t). Such a time series is called the frequency 
time series (FTS). 

2. Take a document of N words. Consider the length 
/ (number of letters) of each word. Record where 
each word of length / is located in the text; the 
time is played by the position of the word in the 
document, i.e. the first word is considered to be 
emitted at time t= 1, the second at time t = 2, etc. 
A time series l{t) is so constructed. We refer to 
such a time series as the length time series (LTS). 

Obviously there is a large number of ways to map a 
text to a time series, but in the present study we only 
consider the above two since some physical meaning can 
be thought to arise in the mapping. As indicated in [13] 
the length of the word is associated with speaker effort, 
meaning that the longer the word the higher the effort 
required to pronounce it. The frequency of the word is 
also associated with the hearer effort as frequently used 
words require less effort to be understood by the hearer. 
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FIG. 2: D(q) for (a) FTS (b) LTS of the shuffled indicated 
texts 
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FIG. 4: f{a) for (a) FTS (b) LTS of the shuffled indicated 

texts 



4 



These time series are thereby analyzed along the mul- 
tifractal ideas, for which we briefly recall the formulae of 
interest in order to set the notations. 



A. Multifractal Analysis 

Let the (LTS ou FTS) time series having N data points 
(words, here), i.e. yi {I < i < N). 

Transform the series as follows: if the length of a word 
mot in LTS (or its frequency in FTS) is smaller than the 
next one, the former word gets a value = 2; if it is greater, 
it gets the value = 1; and if both are equal. 

The new series is called Mi (1 < i < N — 1). Each 
Mi is cut into iV^ subseries of size s , where Ng is the 
smallest integer in N/s. The ordering starts from the 
beginning of the text (contrary to analyses in which some 
forecasting is expected, and for which the end "points are 
more relevant). 

Next one calculates the probability 



^t=iM(^^i)s+i 
^^=i^i=iM(i,-i)s+i 



for every v and s. Thereafter one calculates 
x(s,g) = S^^iP(s,z.)« 



(1) 



(2) 



for each s value. A power law behaviour is expected 

x{s,q)-s^^'^; (3) 

where T{q) plays the role of the partition function [T]. 
The generalized fractal dimension D{q) [TJ |2 is defined 
from 



r{q) 
q-l 



and the generalized Hurst exponent, h{q), from 

1 + t(<z) 



h{q) = 



Let 



dT{q) 
dq ' 



(4) 



(5) 



(6) 



from which one obtains the generalized critical exponent, 
/(a) curve |44] from 



II. RESULTS 

The results of the FTS and LTS multifractal analysis 
for the three main texts and their shuffled corresponding 
ones are shown in Figs. 1-4 (a-b). 

A mere perusing of the graphs indicate that the multi- 
fractal approach is in good order, e.g. since D{q) is not a 
single point, and should allow one to observe interesting 
LRO correlations and local ones. 



A. D{q) plots: Figs. 1-2 

In FTS, the generalized fractal dimension has a similar 
set of values for both english texts, decaying from ca. 1.2 
to 1.0 for q increasing but negative; D{q) decays slowly 
for q positive, barely reaching a value 0.95 for g = 80 (Fig. 
1). The value of D{q) is much greater along the negative 
q axis, for AWLesp but is identical to the other two for 
g > 0. In LTS, even though the form of D{q) is that to 
be expected, it has to be stressed that the AWLeng and 
AWLesp are very quantitatively similar, but markedly 
differ from TLGeng- This already indicates that one can 
observe the high creativity of the author through these 
two books. Moreover the translation effect on style is 
much better seen on FTS than LTS. 

The shuffled texts (Fig. 2) remarkably have the same 
D{q) values; their range and variations being similar to 
those of the real texts. Slight quantitative differences 
occur, more markedly for the shuffled AWLesp FTS, but 
along a Baeysian reasoning these can be attributed to 
the finite size of the sample. 

By the way. 



Ci = 



dr{q) 
dq 



(8) 



9=1 



is a measure of the intermittency lying in the signal y{n); 
it can be numerically estimated by measuring around 
(7=1. In all cases the value of Ci is close to unity. Some 
conjecture on the role/meaning of Ci is found in Ref . |45] . 
From some financial and political data analysis it seems 
that Hi is a measure of the information entropy of the 
system. The same can be thought of here. 



B. /(a) plots: Figs. 3-4 



f{a) = qa - T{q). (7) 

In the present work, we have calculated x(s, q) for s 
between 2 and 200. The T{q) values were calculated by 
a linear best fit on a log-log plot of x{s, <?) and s, for 
q values ranging from -25 till 25. As one may expect 
the q and q — 1 values at the denominator in Eq.(4)- 
Eq.(5) were leading to numerical singularities. A smooth 
interpolation can be visually made without difficulty. For 
conciseness we don't show x(s,'?)- 



The f{a) spectra are shown in Figs. 3-4. They are 
markedly non symmetric, as was found for DLA |14Lll5j . 
with very high positive skewness, i.e. for q < 0. Interest- 
ingly, the esperanto text curve behaves differently from 
the english texts, in FTS, though TLGeng is different in 
the LTS case; the shuffled texts /(a) spectra behaving in 
a very similar qualitative and quantitative way. However 
the shuffling does not fully symmetrize the spectra. 

It is interesting to observe that the f{a) curve is very 
sharp: it originates from negative values for a less than 



5 



1.0; reaches a maximum (=1.0) at 1.0, at the maximum 
so called box dimension, and decays rapidly for a posi- 
tive; /(a) = at a= 1.2 and 1.3 respectively for AWLeng 
and TLGeng; the maximum is also reached at (1.0, 1.0) 
and the spectrum spans the (narrow) interval 0.90: 1.25 
for AWLesp on the f{a) —0 line. It is worth noticing 
that the values are reasonable in view of their correspon- 
dence to the fractal dimensions. On the other hand the 
sharpness indicates a high lack of uniformity of the texts 
LROC. 



Series 


D{q) 


/(") 


FTS 


AWLoncr AWLpno- 

TLGeng TLGeng 


AWLpncr AWLpno- 

TLGeng TLGeng 


LTS 


AWLeng AWLeng 
AWLesp TLGeng 


AWLeng AWLeng 
AWLesp TLGeng 



TABLE I; Comparing original texts quasi identical behaviors 
through functions D{q) and /(a), and their counterpart for 
shuffled (s) cases, i.e. D"{q) and /"{a) 



III. DISCUSSION WITH CONCLUSION 

In summary, one can observe similarities between the 
original and shuffled texts and their translations; see Ta- 
ble 1 for summarizing the similarities seen through D(q) 
and /(a). The english texts look more similar with each 
other than with respect to the esperanto translation. On 
the other hand, one physical conclusion arises from the 
above : the existence of a multifractal spectrum found 
for the examined texts indicates a multiplicative process 
in the usual statistical sense for the distribution of words 
length and frequency in the text considered as a time se- 
ries. Thus linguistic signals may be considered indeed 
as the manifestation of a complex system of high di- 
mensionality, different from random signals or systems 
of low dimensionality such as the financial and geophys- 
ical (climate) signals. Finally, the f{a) curve represents 
the measurable aspects of the word networks, be they 
considered through LTS or FTS. Our work confirms that 
texts could be seen as networks indeed [25] . 

Before suggesting a physical model describing the 
construction of a writing, let us consider implications 
from the /(a) spectrum in some detail. The not fully 
parabolic, to say the least, /(a) curve indicates non uni- 
formity and strong LROC between long words and small 
words, - evidently arising from strong short range corre- 
lations between these. In some sense this is expected for 
classical writings. It is usually known that the left (right) 
hand side of the f{a) curve corresponds to fluctuations 
of the 9 > (g < 0)-correlation function. In other words, 
they correspond to fluctuations in small (large) word dis- 
tributions. Therefore the distribution of small and long 
words should be examined in further work in order to ob- 
serve these local correlations, e.g. through a detrended 
fluctuation analysis. 

Moreover, in order to characterize the writings, texts 
and/or authors, we propose a very rough approxima- 
tion/model, i.e. let us consider (assume !) that the writ- 
ings are made of only two types of words : small and 
large |46}l47j. appearing through some recursive process. 
In so doing one can consider the behavior of the atypi- 
cal /(a) curve as originating form a binomial cascade of 
short and long words, on a support [0,1], with an arbi- 
trary contraction ratio and a weight Wi for the word in 
each successive subinterval, as for a multifractal Cantor 
set construction [T . For an arbitrary number n of subin- 



Original texts 



AWLeng FTS 
AWLesp FTS 
TLGeng FTS 



AWLeng LTS 
AWLesp LTS 
TLGeng LTS 



Shuffled texts 
AWLeng FTS 
AWLesp FTS 
TLGeng FTS 
AWLeng LTS 
AWLesp LTS 
TLGeng LTS 



Q a- 



wi W2 ri 



r2 



5.71 0.95 L19 0.96 0.04 0.97 0.03 
4.39 0.94 L30 0.99 0.01 0.99 0.01 
5.71 0.95 1.19 0.99 0.01 0.99 0.01 



4.65 0.92 1.23 0.89 0.11 0.91 0.09 
4.83 0.92 1.21 0.87 0.13 0.89 0.11 
3.94 0.92 1.34 0.96 0.04 0.97 0.03 



Q o_ Q_|_ wi W2 ri r2 

6.94 0.95 1.13 0.89 0.11 0.90 0.10 

6.57 0.96 1.16 0.97 0.03 0.97 0.03 

6.59 0.94 1.13 0.82 0.18 0.84 0.16 

4.35 0.91 1.25 0.88 0.12 0.90 0.10 

4.56 0.92 1.24 0.90 0.10 0.92 0.08 

4.35 0.91 1.25 0.88 0.12 0.9 0.10 



TABLE II: Tsallis Q-parameter, derived from a_ and a+ 
values, read on Figs. 3-4, from Eq. (11), for the original texts 
and for shuffled or translated corresponding texts, according 
to the type of series; the weights and ratios of the binomial 
cascade approximation (see text) are given 



tervals the generalized fractal dimension (or rather t((j)) 
is obtained from 



i=l 



1. 



(9) 



The formula is easily generalized for random contraction 
ratios and weights. However it simplifies for the case of 
a simple binomial cascade, i.e. 



q — r I u - 



1. 



(10) 



Whence the extremal a values read a_ = 
log(w2)/log(r2) and a+ = log(wi)/ log(ri), from 
which the weights and ratios can be estimated by 
inversion (table |l]), thereby suggesting the author's 
somewhat systemic way used in his/her writings. 

The physics connection can be obtained if one relates 
the f{a) curve extremal points through their physical 
meanings [IS], i.e. 



(11) 



1-Q 



where a_ and a_|_ are the extremes of the range of sup- 
port for the (positive) multifractal spectrum /(a) and Q 
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(instead of the usual q) is used to represent the parame- 
ter arising in the non extensive description of statistical 
physics [in] ; see values in table |l] By extension it is a 
measure of the attractor dimension or the number of so 
called degrees of freedom. It is obvious from the table 
that Q varies between 4 and 7, with interesting differ- 
ences between the LTS and FTS cases, LTS's Q being 
systematically smaller, in the original or shuffled texts. 
Notice that the value of Q is more extreme though with 
the same order of magnitude in the case of the esperanto 
text for both types of series. 

Finally we re-emphasize the remarkable difference for 
the esperanto text (Fig. 3a) with the english texts in the 
FTS analysis. Linguistics input should be searched at 
this level and is left for further discussion. The origin of 
differences between TLG and AWL needs more work at 
the linguistic level. However we have indicated the in- 
terest of the multifractal scheme in providing a measure 



of these correlations, thus a new measure of an author's 
style. This suggests a (binomial, at first) cascade model 
containing parameters characterizing (or reflecting, at 
least) the text style, and most likely in fine the author 
writings. It remains to be seen whether the /(a) curve 
and the (to be generalized) binomial cascade model, with 
the weight and ratio parameters hold through in other 
cases, and can characterize authors and texts, - and in 
general time series. Moreover the multifractal method 
should additionally be able to distinguish a natural lan- 
guage signal from a computer code signal |43| and help in 
improving translations by suggesting perfection criteria 
and indicators of text qualitative values. 
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